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Abstract 

By using a mixture model for the density distribution of the three pseudobond angles formed by Ca 
atoms of four consecutive residues, the local structural states are discretized as 17 conformational letters 
of a protein structural alphabet. This coarse-graining procedure converts a 3D structure to a ID code 
sequence. A substitution matrix between these letters is constructed based on the structural alignments 
of the FSSP database. 
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1 Introduction 

Drastic approximations are unavoidable in prediction of protein structure from the amino acid sequence. 
Most local structure prediction methods use three secondary structure states: helix, strand and loop. How- 
ever, segments of a single secondary structure may vary significantly in their 3D structures. A refined 
objective classification of segments may enhance our ability in the prediction of structures, and deepen our 
understanding of the modular architecture of proteins. 

The usual approaches simplify protein structure by modelling proteins as chains of one or two interacting 
centers representing individual amino acids, and adopt only a small number of discrete conformational states. 
Many studies to investigate the classification of protein fragments use the backbone (</), i/j) dihedral angles, 
or angles of Ca psuedobonds or distances derived from the positions of Ca atoms. Due to the anticorrelation 
between cj) and ip (McCammon et al., 1977; Flocco and Mowbray 1995), there may be instances where a 
big change in both (j) and ip does not represent an obvious change in the Ca pseudobond angles, but a 
reorientation of the peptide in question. Furthermore, the relation between Ca coordinates and pseudobond 
angles is rather straightforward, and pseudobond angles have a more direct geometric meaning than distances. 
We shall use only pseudobond angles in this paper. 

By restricting the local conformations of individual residues to a handful of states, one can discretize 
protein conformation to convert the 3D structure of a backbone to a ID sequence of these discrete states 
akin to the amino acid sequence. Prediction of protein structure depends on the accuracy and complexity of 
the models used. A model must be as simple as possible to reduce the conformational space to be searched 
for a correct conformation, while a model of low complexity tends to have a lower accuracy. A model must 
represent the actual geometry of protein conformations accurately enough, but a complex model is prone to 
over-fitting the observed data. 

Generally, the procedure to deduce finite discrete conformational states from a continuous conformational 
phase space is a clustering analysis. There have been a variety of different ways of clustering. For example. 
Park and Levitt (1995) represent the polypeptide chain by a sequence of rigid fragments that are chosen 
from a library of representative fragments, and concatenated without any degrees of freedom. The average 
deviation of the global-fit approximations over the training set is taken as the objective function for optimizing 
the finite representative fragments. The state clusters there are representative points of the phase space. 
Rooman, Kocher and Wodak (1991) intuitively divide the </)-■!/' space into 6 regions, which corresponds to 
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a partitioning based on the Ramaehandran plot. Standard methods for clustering analysis have been also 
used to generate discrete structure states (Bystroff and Baker, 1998). 

Hidden Markov models (HMMs; Rabiner, 1989), possessing a rigorous but flexible mathematical struc- 
ture, have been used in a variety of computational biology problems such as sequence motif recognition 
(Fujiwara et al., 1994), gene finding (Burge and Karlin, 1997), protein secondary structure prediction (Asai, 
Hazamizu and Handa, 1993; Zheng, 2004), and multiple sequence alignments (Krogh et al., 1994). The 
HMMs have been also used for identifying the modular framwork for the protein backbone (Edgoose, Allison 
and Dowe, 1998; Camproux et al., 1999). In these HMMs conformation states are represented by probability 
distributions, which is much finer than a simple partition of the phase space. HMMs also take into account 
the sequential connections between conformational states, hence involve in a large number of parameters, 
which make the model training a tough task. Furthermore, it is not so convenient to assign structure codes 
to a short segment with HMMs. 

Here we develop a description of protein backbone tertiary structure using psuedobond angles of suc- 
cessive Ca atoms. Finite conformational states as structural alphabet are selected according to the density 
peaks of probability distribution in the phase space spanned by pseudobond angles, and their feasibility of 
characterizing short segment polypeptide backbone conformation is examined. In order to use the structural 
codes in the structural comparison, we derive a substitution matrix of these conformational states from a 
representative pairwise aligned structure set of the FSSP (families of structurally similar proteins) database 
of Holm and Sander (1994). 



2 Methods 

Among a variety of abstract representing forms for protein 3D structure, a frequently encountered one is 
the protein virtual backbone. The Ca atom of the residue is chosen as the representative point. In this 
representation, two adjacent residues in a protein sequence are virtually bonded, forming a pseudobond. 



2.1 Pseudo-bond angles 

The virtual bond bending angle 9 defined for three contiguous points (a, b, c) is the angle between the vectors 
^ab = Tfo — and Fbc, i.e. 6 = Tab ■ TCbc/{\rab'''bc\)- The range of 9 is [0,27r]. The virtual bond torsion angle r 
defined for four contiguous points (a, b, c, d) is the dihedral angle between the planes abc and bed. The range 
of T is (— TT, 7r], and its sign is the same as {Vab x rtc) '^cd- In fact, we may adopt a wider range of t imder the 
equivalence relation that ti and T2 are equivalent if ri = T2 (mod 27r). For the four-residue segment abed, 
by takeing a as the origin, and b on the a;-axis, and c on the a;y-plane, the number of independent relative 
coordinates are 6. The assumption of the fixed pseudobond length, which is 3.8 A for the dominating trans 
peptide, further reduces the number of degrees of freedom to 3. These independent coordinates correspond 
to the angles {9abc, Tabcd, 9bcd)- Elongating the segment by one residue e will add two more angles Tbcde and 
9cde- Generally, for a sequence of n residues, we have n — 2 bending angles and n — 3 torsion angles, 2n — 5 in 
total. We shall assign the angle pair {Tabcd, 9bcd) = {tc, 9c) to residue c, the third of the four-residue segment. 

Bending and torsion angles of a chain correspond to curvature and torsion of a curve. The relative 
coordinates of the chain {rg, ri, • • • r„} can be recovered from their 2n — 5 angles {^i; T2, ^2; • • • ; Tn-i, ^n-i}- 
By convention, we set the origin at Tq, put ri along the x-axis, and add Ti = 0. Introducing the rotation 
matrices Re and Rr (with respect to the z- and a;-axis, respectively) 

/ cos9 -sm9 0\ ( ^ ^ °\ /M 

Rg = sinS cos 9 , ij^ = I cost — sinr I , and d = ri = I , (1) 
\ 1/ \ sinr cost / \ / 

position Vk is determined by 

To = /, ro = 0-d, Tk = Tk-iRr,Re„ dk=Tk-i-d, rk=rk-i+dk, k>l, (2) 

where / is the identity matrix. 
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Longer fragments will include more correlation than shorter fragments. However, the complexity that 
can be explored with the longer fragment lengths is limited severely by the relatively small number of known 
protein structures, and a larger number of discrete states have to be determined for a longer segment. The 
minimal unit where the relative coordinates fix the angles and vice versa is four contiguous residue segment. 
We shall concentrate mainly on the structure codes for the four residue unit. 

2.2 The mixture model for the angle probability distribution 

The three pseudobond angles {0, r, 6') of the four-residue unit span the three-dimensional phase space. Our 
classifiers for conformational states are based on the following mixture model M: The probability distribution 
of 'points' X = {9, T, 9') is given by the mixture of several normal distributions 

c 

P(x|M)=5^7ri7V(ft,Si), (3) 

where c is the number of the normal distribution categories in the mixture, TTi the prior for category i, and 
N{fj,, S) the normal distribution 

Nil,, S) = (27r)-3/2|S|-V2 exp[i(x - m) • • (x - /.)]• (4) 

Each normal distribution has 6 parameters for its symmetric covariance matrix S and three for its mean fi. 

Adding one more parameter of the prior for each category, the mixture model has 10c parameters for the 
total c categories. (The normalization X^^tTj = 1 reduces the number to 10c — 1.) These categories will be 
translated as the structure codes. 

To objectively determine the number c of categories, we investigate density peaks in the phase space with 
the downhill simplex method of Nelder and Mead (1965). The method requires only function evaluations, 
not derivatives. It is not very efficient in terms of the number of function evaluations that it requires, but 
still works well for our problem here. We use counts in a rectangular box as the value of the function for 
optimization at the center of the box. The box size corresponds to the Parzon window width. A large 
box size has a low resolution, hence help us to focus on main density peaks in the phase space, and to 
easily locate them near their real location. Reducing the box size, we can see more peaks which are less 
conspicuous and then unseen under a larger box size. A too small box size, making local fluctuations visible, 
is often misleading. Missing out any important modes will affect the model training and the efficacy of the 
structural codes generated. We first search for maximal points of the one-dimensional marginal probability 
distributions of 9 and r, and then utilize them to generate a grid in the (6*, r, 9') space for searching for peaks 
in the space. 

We examine also density peaks in the five-dimensional phase space spanned by {9i,,Tc,9c,T(i,9d) of the 
five-residue unit abcde to investigate the effect of the angle correlation. A five-angle mode {9b,Tc,9c,Tii,9d) 
contains two three-angle modes (^f,, Tc, 6c) and {6c, ra, 64). It is demanded that all the important three-angle 
modes implied by the main density peaks in the five-angle phase space must be included in the modes used 
for the construction of the mixture model. 

The main purpose of searching for density peaks is to estimate the number c of categories and {in) for 
each category. Once this has been done, we may start with some simple {Tr^} and {S^}, say tt, = 1/c and 
certain diagonal {Sj}, and then update the mixture model by the Expectation-Maximization (EM) method 
as follows. For each point xj, = {6k-i,Tk, 9k), we calculate the probability for the point to belong to the i-th 
category C, according to the Bayes formula as 

P{Ci\^k) oc 7r,P(xfe|C0 

a 7ri|Si|-i/2gxp[i(xfc -/ij) • Sri . (xfc -jtife)], (5) 

where we always shift Tk to the interval centered at r^*' of the r-component of the mean /i^. 

The probability P(Ci|xfe) satisfies the normalization condition J2i=i P{Gi\xk) = 1. The updated parameters 
for the mixture model are estimated by the EM method as 

rii = ^P(Ci|xfc), TTi=ni/n, n = '^n^, (6) 

k i 
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ft = - y]P(Ci|xfe)xfc, (7) 

k 

Si = ^P(Ci|xfc)(xfe -/xfc)(xfe -/ifc)^. (8) 

k 

Generally, the objective function for optimizing the mixture model is 

Prob({xfc}) = n E^(^'^'^i) °^ n Y^PiCil^k)- (9) 

k i k i 

However, when we convert point Xfc to its structural code i* , we use 

i* = arg^ maxP(Ci|xfe). (10) 
An alternative objective function would be 

Q{{^k}) = T\m^a.-KP{Ci\^k)- (11) 

k 

When starting with narrow distributions for Sj, a very high value of Q could be seen at the first step. 
However, by just one stop of the EM iteration Q will drop significantly, and then increases at later steps. 
While Prob({xfe}) never decreases, Q will decrease after reaching its maximum. Wo may stop the model 
training before Q decreases again. Thus, the optimization here is a compromise between Prob({xfc}) and 

Q({xfe})- 

Once we have the model, we may convert a structiirc to its conformational code sequence according to 
(10). Although no effect from the connection of states is directly considered, the model gains the advantage 
in being able to easily assign codes to short fragments. 



Table 1. The 17 structural states from the mixture model. 



State 




|5]|-V^ 


9 


M 

T 


6' 


ee 


tO 


s- 

rr 


-1 

e'e 


e'T 


e'e' 


I 


8.2 


1881 


1.52 


0.83 


1.52 


275.4 


-28.3 


84.3 


106.9 


-46.1 


214.4 


J 


7.3 


1797 


1.58 


1.05 


1.55 


314.3 


-10.3 


46.0 


37.8 


-70.0 


332.8 


H 


16.2 


10425 


1.55 


0.88 


1.55 


706.6 


-93.9 


245.5 


128.9 


-171.8 


786.1 


K 


5.9 


254 


1.48 


0.70 


1.43 


73.8 


-13.7 


21.5 


15.5 


-25.3 


75.7 


F 


4.9 


105 


1.09 


-2.72 


0.91 


24.1 


1.9 


10.9 


-11.2 


-8.8 


53.0 


E 


11.6 


109 


1.02 


-2.98 


0.95 


34.3 


4.2 


15.2 


-9.3 


-22.5 


56.8 


C 


7.5 


100 


1.01 


-1.88 


1.14 


28.0 


4.1 


6.2 


2.3 


-5.1 


69.4 


D 


5.4 


78 


0.79 


-2.30 


1.03 


56.2 


3.8 


4.2 


-10.8 


-2.1 


30.1 


A 


4.3 


203 


1.02 


-2.00 


1.55 


30.5 


9.1 


8.7 


6.0 


5.7 


228.6 


B 


3.9 


66 


1.06 


-2.94 


1.34 


26.9 


4.6 


4.9 


9.5 


-5.0 


54.3 


G 


5.6 


133 


1.49 


2.09 


1.05 


163.9 


0.6 


3.8 


2.0 


-3.7 


32.3 


L 


5.3 


40 


1.40 


0.75 


0.84 


43.7 


2.5 


1.4 


-7.0 


-2.9 


34.5 


M 


3.7 


144 


1.47 


1.64 


1.44 


72.9 


2.1 


4.8 


1.9 


-7.9 


72.9 


N 


3.1 


74 


1.12 


0.14 


1.49 


25.3 


3.2 


3.1 


9.9 


0.9 


83.0 





2.1 


247 


1.54 


-1.89 


1.48 


170.8 


-0.7 


3.7 


-4.1 


3.1 


98.7 


P 


3.2 


206 


1.24 


-2.98 


1.49 


48.0 


8.2 


7.3 


-4.9 


-6.6 


155.6 


Q 


1.7 


25 


0.86 


-0.37 


1.01 


28.4 


1.5 


1.2 


3.4 


0.1 


19.5 



3 Result 

For establishing the discrete structural states by training the mixture model, we create a nonredundant set of 
1544 non-membrane proteins from PDB_SELECT (Hobohm and Sander, 1994) with amino acid identity less 
than 25% issued on 25 September of 2001. The data of the three-dimensional structures for these proteins 
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are taken from Protein Data Bank (PDB). The secondary structures for these sequences are taken from the 
DSSP database (Kabsch and Sander, 1983). We consider the reduced 3 secondary structure states {h,e,c} 
generated from the 8 states of the DSSP by the coarse-graining H,G,I ^ h, E ^ e and X, T,S,B ^ c. The 
total number of contiguous fragments is 2248, which gives totally 264,232 points in the three-angle phase 
space. 



Table 2. The percentages of each secondary structure in the structural states. 
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M 


N 
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P 


Q 


Counts 


cccc 


3 


3 


1 


5 


3 


7 


14 


3 


7 


3 


10 


9 


5 


7 


4 


5 


3 


25090 


ccce 





1 





3 


3 


6 


15 


2 


5 


4 


24 


9 


4 


4 


4 


4 


3 


3272 


ccch 


2 


1 


1 


6 


5 


8 


22 


4 


3 


2 


9 


11 


7 


4 


1 


3 


2 


3028 


ccee 














7 


20 


17 


7 





1 


15 


24 














2 


4029 


cchh 


1 


3 





1 








1 





46 


1 








17 


4 





18 


1 


3664 


ceec 














6 


36 


26 


14 





8 





2 














3 


620 


ceee 














7 


43 


12 


22 





4 





2 











2 


3 


3676 


ceeh 














3 


11 


49 


19 





3 














1 


1 


7 


51 


chhh 


21 


38 


28 


4 


























4 


2 











4353 


eccc 


3 


3 


1 


3 


2 


6 


15 


3 


14 


3 


5 


11 


4 


7 


2 


7 


4 


3007 


ecce 


3 


1 


1 


6 


1 


4 


5 


1 


1 


1 


1 


4 


1 


26 


35 





1 


492 


ecch 


1 


1 





5 


1 


6 


18 


5 


4 


1 


9 


24 


5 


6 





3 


2 


258 


ecee 


1 











5 


18 


16 


10 


1 


2 


3 


33 


1 








2 


3 


80 


echh 





1 




















52 


1 


1 





10 


5 





17 


5 


256 


eecc 














6 


16 


19 


7 


12 


6 





3 





9 





11 


4 


3807 


eece 














6 


18 


21 


11 


11 


6 





3 


1 


11 





6 


2 


80 


eech 














4 


15 


25 


16 


2 


5 





3 





6 





10 


10 


256 


eeec 














7 


36 


19 


14 


1 


9 





6 











1 


3 


3596 


eeee 


U 


U 


U 


U 


c 



A Q 


Q 

o 


1 '7 


U 


D 


U 


I 


U 


U 


U 


o 

z 


1 




eeeh 














4 


15 


41 


17 


2 


11 





2 











1 


4 


197 


eehh 




















3 





57 


2 











2 


1 


28 


4 


248 


ehhh 


13 


43 


25 


3 


























6 


5 


1 








248 


hccc 


4 


5 


2 


4 


2 


4 


7 


1 


4 


2 


14 


6 


6 


5 


18 


6 


1 


3254 


hcce 


1 


2 





3 


5 


5 


11 





7 


5 


22 


7 


4 


4 


6 


7 


3 


208 


hcch 


3 





1 


4 


3 


5 


21 


5 


4 


2 


14 


9 


7 


4 


4 


3 


1 


328 


hcee 














6 


21 


18 


9 





4 


17 


14 


1 











2 


151 


hchh 


1 


2 





1 








1 





12 


1 








27 


18 





31 





356 


heec 
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50 


15 


10 





10 
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5 





20 


heee 














4 


50 


4 


17 


1 


6 





3 











8 


1 


117 


hhcc 
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15 


11 


9 


1 











1 





15 


10 


9 


1 


10 


2 





3861 


hhce 


1 


1 





1 


2 





2 





1 


1 


43 


15 


6 





14 


3 





151 


hhch 


8 


4 


3 


19 


3 





1 











15 


28 


7 


2 


3 








356 


hhee 
































52 


40 


3 














137 


hhhc 


23 


21 


25 


20 























2 


2 


3 


1 








4464 


hhhe 


29 


30 


15 


11 























4 


4 


2 


2 








137 


hhhh 


21 


11 


60 


4 


























1 














31327 



3.1 The discrete structural states 

The marginal one-dimensional distribution of the pseudobond bending angle has two prominent peaks around 
9 = 1.10 and 1.55 (radians). Non-zero 0s are in the interval [.4, 1.9]. The marginal one-dimensional distri- 
bution of the torsion angle t has one immediately noticeable peak at r = 0.87 (corresponding to the helix). 
Another peak at r = —2.94 is less prominent. There is a vague peak still recognizable around r = — 2.00. A 
grid generated with 9 G {1.00, 1.55} and r G {-2.80, -2.05, -1.00,0.00,0.87} is used to search high dimen- 
sional phase space for density peaks by the downhill simplex method. In the box counting, the box size is 
taken from 0.1 to 0.2 for 9, and the width for r is twice of that for 6. The helices seen as a single peak in 
the three-angle phase space are clearly identified as several sub-peaks in the five-angle phase space. Further 
exploring main peaks in the five-angle phase space, we identify 17 mode centers, which are then used as 
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the main initial parameters to train the mixture model. Finally, the 17 structural states arc obtained for 
the mixture model by the EM algorithm. They are listed in Table 1. Note that it is the entries of the 
inverse covariance matrix that are given. The determinant of the matrix is a measure of the divergence of 
the corresponding mode. The most sharp state is H, while the most vague state is Q, which occupies the 
least proportion of phase points. 



Table 3. The forward transition rates (multiplied by 100) between structural states. 
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12 


15 
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22 
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17 
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1 


3 


1 


10 
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1 
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14 


8 
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2 
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2 
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12 


26 
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8 
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13 
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2 
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2 
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12 
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4 


9 


28 


8 


4 


2 


10 


1 
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11 


7 
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2 
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22 


12 


9 





1 
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14 
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8 


11 


2 


3 


2 





2 


2 


10 


13 


9 


3 


3 


8 
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3 


5 


2 


4 


3 


2 


2 
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1 


35 


8 


13 


3 


13 


4 





O 


1 


1 





2 


2 


2 


1 








3 


56 
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17 


2 


4 


4 





P 


12 


18 


7 


7 


2 


1 


1 











12 


23 
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1 


4 


3 
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3 


16 


15 


8 


10 


5 


1 
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1 


20 





3 


13 



3.2 The structure alphabet and the secondEiry structure 

Our 17 structiu-al states or letters of the structural alphabet describe the local structure of four-residue 
segments, and a code is assigned to the third residue of the unit of four residues. The total number of 
possible four-residue secondary structures is 37. The restriction of the minimal lengths 2 for e and 3 for h 
removes 44 quartets from the total 3^ = 81. 

In order to make a detailed comparison between the secondary structures and the discrete structural 
states, from the training set we extract a subset, which contains 676 fragments and 118,621 residues (hence 
116,593 points in the three-angle phase space). We arrange the corresponding counts in Table 2. (Secondary 
structure heeh has zero count, so it is omitted.) The table shows the percentages of each secondary structure 
in the structural states. It is clearly seen that there exists a correlation between the two types of structure 
classifications. For example, from Table. 2 hhhh are mainly attributed to iJ, / and J, while eeee to E, and 
D. The mutual information between the conformational codes and the secondary structure states equals 
0.731. In Table 2, the row cccc shows rather uniform percentages in different structural states as we would 
expect. 

3.3 Transition between structural states 

Any two sequential points (0j-i,Tj,0j) and (0j, Tj+i, ^j+i) share the common angle The effect of the 
connection of sequential structural states reflects transition rates between structural states. We first convert 
the 3D structures of the training set to their structure code sequences, and then determine the transition 
rates by counting code pairs. The obtained rates are listed in Table 3. 

The entries of Table 3 are the forward transition rates as the conditional probability of the {i + l)-th site 
of a state chain at a given ?'-th site. Normalized according to the row. the table tells where a row state would 
like to go. Extended states, e.g. H and E, are characterized by large diagonal elements, while transient 
states, e.g. A and G, have almost vanishing diagonal rates. From the table, we may trace the capping states 
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for the helix and /3-strand. For example, A is an important mode which leads to the helix, and G is a main 
leaving mode for the helix. 

Table 4. CLESUM: The conformation letter substitution matrix (in the unit of 0.05 bit). 
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-13 


12 


-14 
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-33 


-58 


-55 


-35 


-4 


6 


-14 


3 


7 


41 
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-22 


-43 


-39 


-17 


10 


13 


-12 


-7 


-2 


19 


34 
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-23 


-54 


-37 


5 


14 


-13 


-5 


-2 


5 


-12 


2 


23 
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-42 


-75 


-59 


-32 


-5 


27 


-2 


-6 


-12 


5 


4 


12 


1 


51 
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-6 


-27 
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2 
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-95 


-67 
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-6 
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-2 


-22 


-31 


19 


24 
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-122 
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-81 


-45 


13 


-24 


-32 


-50 


11 


-11 


-19 


-43 


19 


21 


20 


49 
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3.4 Structural substitution matrix 

Sequence alignment is the main procedure of comparing sequences. Certain amino acid substitutions com- 
monly occur in related proteins from different or same species. Amino acid substitution matrices, extracted 
from our knowledge of most and least common changes in a large number of proteins, serve for the purpose 
of sequence alignment. The popular BLOSUM matrix of Henikoff and Henikoff (1992) is derived from a large 
set of conserved amino acid patterns without gaps representing various families. The frequency of amino 
acid substitutions in alignments is counted in sequence alignments. These frequencies are then divided by 
the expected frequency of finding the amino acids together in an alignment by chance. The ratio of the 
observed to the expected counts is an odds score. The BLOSUM entries are logarithms of the odds scores 
with the base 2 and multiplied by a scaling factor of 2. 

To use our structural codes directly for the structural comparison, a score matrix similar to BLOSUM 
is desired. There is a database of aligned structures, the FSSP of Holm and Sander (1997), which is 
based on exhaustive all- against- all 3D structure comparison of protein structures in the PDB. The proteins 
in the FSSP are divided into a representative set and sequence homologs of the representative set. The 
representative set contains no pair which have more than 25% sequence identity. In the version of Oct 2001, 
there are 2,860 sequence families representing 27,181 protein structures. A tree for the fold classification of 
the representative set is constnicted by a hierarchical clustering method based on the structural similarities. 
Family indices of the FSSP are obtained by cutting the tree at levels of 2, 4, 8, 16, 32 and 64 standard 
deviations above database average. We convert the structures of the representative set to their structural 
code sc!quences. All the pair alignments of the; FSSP for the; proteins with the same first three family indices 
in the representative set are collected for counting aligned pairs of structural codes. The total number of code 
pairs are 1,143,911. The substitution matrix derived in the same way as the BLOSUM was obtained is shown 
in Table 4, where a scaling factor of 20 instead of 2 is used to show more details. We call this conformation 
letter substitution matrix CLESUM. Henikoff and Henikoff (1992) introduced for their BLOSUM the average 
mutual information per amino acid pair H, which is the KuUback-Leibler distance between the joint model 
of the alignment and the independent model. The value of H for our CLESUM equals 1.05, which is close 
to that for BLOSUM83. 
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4 Discussion 



Biologically important modules have been repeatedly employed in protein evolution by gene duplication 

and rearrangement mechanisms. They form components of fundamental \mits of structure and function. 
The presence of modules provides a guide to classify proteins into module-based families, and helps the 
structure prediction. The existence of such conservative recurrent segments sets a solid foundation for the 
local analysis. We have discretizcd the combination of three psuedobond angles formed by four consecutive 
Ca atoms to convert the local geometry to 17 coarse-grained conformational letters according to a mixture 
model of the angle distribution. 



4.1 The precision of the conformational codes 

From the correlation between the conformational codes and the secondary structures (Table 2), it is not 
surprising that there exists a propensity of the codes to amino acids. The coarse-graining would introduce 
an error. It is then important to examine the precision of the codes. For this purpose, we randomly pick up 
1,000 points for each code, and calculate the distance root mean squared deviation (drms) for each of the 
total 499,500 pairs from their coordinates. The drms of structures a and b, without requiring a structure 
alignment, is defined as the averaged distance pair difference 



drms = 



2 n i—l 



n(n — 1) „ . , 

^ ' i=2 j=l 



1/2 



(12) 



where r^i is the coordinate of atom i in structure a. The averaged coordinate pair difference, i.e. the 
coordinate root mean squared deviation arms, is about 1.2 times of the drms. 

The errors of the conformational codes are listed in Table 5. The most precise code H has an error 
0.133 ± O.O6OA, while the vaguest code L has an error 0.604 ± 0.365A. After averaging over the code relative 

frequencies, the mean error is O.330A. 

Table 5. The errors of the conformational codes. 



Conformational code 


I 


J 


H 


K 


F 


E 


C 


D 


Mean drms (A) 


0.244 


0.246 


0.133 


0.452 


0.398 


0.307 


0.392 


0.262 


Standard deviation (A) 


0.110 


0.124 


0.060 


0.219 


0.287 


0.173 


0.218 


0.149 


A 


B 


G 


L 


M 


N 





P 


Q 


0.347 


0.322 


0.390 


0.604 


0.481 


0.551 


0.538 


0.252 


0.506 


0.163 


0.197 


0.192 


0.365 


0.231 


0.321 


0.318 


0.134 


0.287 



4.2 The connection effect of sequential states 

Compared with the HMM, the mixture model does not include the connection effect of sequential states. 

The parameter number increases quadratically with the number of categories for a Markov model, while only 
linearly for a mixture model. We have to compromise between precision and correlation. A mixture model 
with fine categories is also promising. 

Since the model training involves a global optimization the choice of a good initial trial plays an important 
role. A careful exploration of the density distribution in the five-angle space corresponding to two consecutive 
conformational states reveals that the peaks in the five-angle space give a finer picture of the peaks in the 
three-angle space. That is, subpeaks in the three-angle space are easily recognizable from peaks in the 
five-angle space. We have identified 17 intense peaks, which survive the later process of model training. 
Camproux et al. (1999) found 12 modes for the four-residue unit by a HMM. Instead of angles, they used 
a combination of four distances. Since only three of the four are independent their mode centers need not 
correspond to a real conformation. However, we still can see the correspondence between their codes and 
ours: ai-H, a-i-J , a'-K, a'_-0, a\-M, ji-N, 72-P, •y^-Q, 7/3a-^) lap-G, 132-D, and /3i-£'. 
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4.3 Structure alignment via conformational codes 

The conversion of a 3D structure of coordinates to its conformational codes requires little computation. To 
distinguish from the amino acid sequence, we call the converted code sequence the code series, or simply series. 
Once we transform 3D structures to ID series, the structure comparison becomes the series comparison. 
Tools for analyzing ordinary sequences can be directly applied. We have constructed the conformational 
letter substitution matrix CLESUM from the alignments of the FSSP database. We shall examine the 
performance of the conformational alphabet derived above. 

Table 6. The alignment of lurnA and lhal. The first two lines are their amino acid sequences 
aligned according to the FSSP, while the last two lines are the global Needleman-Wunsch alignment 
of the conformational code series. Lowercase letters of amino acids indicate structural nonequiva- 
lence. 

lurnA avpetRPNHTIYINNLMEKIKKDELKKSLHAIFSRFGQILDILVSRS 
lhalb ahLTVKKIFVGGIKEDT EEHHLRDYFEQYGKIEVIEIMTDRGS 
CCPMCEALEEEENGCPJGCCIHHHHHHHHIKMJILQEPLDEEEBGAIK 
. . .BBEBGEDEEMMFMMLFA HHHHHKKMJJLCEBLDEBCECAKK 

lurnA LKMRGQAFVIFKEVSSATNALRSMqGFPFYDKPMRI QYAKTDSDI I AKM 

lhalb GKKRGFAFVTFDDHDSVDKIVIQ kYHTVNGHNCEVRKAL 

. . . GNGEDBEEALAJHHHHHHIKKGNGCENOGCCEFECCALCCAHI JH 
AGCPOLEDEEEALB JHHHHI . I JGALEEENOGBFDEECC 

Holm and Sander (1998) gave an example of the Q://3-meander cluster with four members showing different 
levels of structural similarity. Their PDB-IDs are lurnA, lhal, 2bopA and Imli. The structure of lurnA 
was taken as the frame to superimpose the other structures. The structural similarity to lurnA from high 
to low are lhal, 2bopA and Imli. Taking the scaling factor for the CLESUM to be 2, and using —12 for the 
the gap-opening penalty and —4 for the gap extension, the global Needleman-Wunsch alignment of lurnA 
and lhal is shown in Table 6, where, in the first two lines, the amino acid sequences aligned according to 
the FSSP are also given. It is seen that, except for segment boundaries, the two alignments coincide. The 
alignment of the FSSP and the code series alignment for lurnA and 2bopA have three common segments 
falling in positive score regions of the series alignment. In the alignments for lurnA and Imli two common 
segments longer than 8 are still seen. As for the amino acid sequence alignment, in the case of lurnA and 
lhal two segments of lengths 13 and 21 of the sequence alignment coincide with the FSSP, but no coincidence 
are seen in the other two cases. 

The conformational codes are local. Even though a global alignment algorithm is used, this does not 
guarantee that the found alignment corresponds to the optimal structure superposition. However, the code 
series alignment docs not affected by the domain move, is then good for analyzing the structure evolution. 
For example, the first helix of lhal is shorter than its coimterpart in lurnA by one turn. The FSSP aligns 
the A^-cap (with codes ^"^4) of the lhal helix to the helix (with codes HH) of lurnA, but local structure 
FA is closer to CC (with positive scores) than to HH (with negative scores). 

The CLESUM includes only the structural information. When we compare two structiires we usually 
know also their amino acid sequences. Many papers considered a linear combination of structural alignment 
score and sequence alignment score. This is an approximation of independency. Prom the FSSP, it is possible 
to construct a substitution matrix in the joint space of the structure and sequence. However, STich a matrix 
would have about 6 x 10^ parameters. When the structure is to be emphasized, we may use a reduced 
amino acid alphabet (Zheng, 2004). For example, clustering 20 amino acids into 3 groups would reduce the 
parameter number to about 10'^. We often want to compare a sequence with unknown structure to a known 
structure. In this case, a rectangular substitution matrix of the type of (amino acid)x (conformational code) 
to (amino acid) is useful. The construction of these matrices is our next task. 

It is known that the sequence-structure relationships have not always been strong. Bystroff and Baker 
(1998) have built a library of structure-sequence motifs, which are expected to correspond to functional 
units recurring in different protein contexts and to be found in different c;ombinations in distantly related or 
functionally unrelated proteins. To identify the structural features that have strong sequence preferences is 
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to locate peaks of density distribution in the joint structure-sequence space. Previously, the structure-based 
clustering was a duty much heavier than the sequence-based clustering, so one had to start with a sequence- 
based clustering, and was kept constantly to run between the structure and sequence subspaces. It is then 
interesting to see whether the library can be improved by clustering directly in the joint structure-sequence 
space with the help of conformational codes. This is under study. 

This work was supported in part by the Special Funds for Major National Basic Research 
Project and the National Natural Science Foundation of China. 
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